feed range query fix#47105
Conversation
(cherry picked from commit cc412f0)
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR updates the Cosmos DB Python SDK query pipeline to make query_items continuation tokens split-safe for feed_range and prefix partition-key queries, and to allow safe continuation round-trips across SDK versions during rolling deployments by emitting legacy vs structured tokens based on the current physical partition overlap.
Changes:
- Introduces a v=1 structured continuation token format for multi-partition-overlap
feed_range/prefix-PK queries, while preserving legacy single-string continuations for full-PK and single-partition scopes. - Refactors sync/async
__QueryFeedfeed-range pagination logic to be split-aware, checkpoint on mid-page failures, and validate token identity (collection/query/feed_range). - Adds/updates extensive unit + end-to-end tests covering multi-partition overlap pagination, post-split resume behavior, checkpointing, and aggregate merge behavior; updates index-metrics assertions to be region/build tolerant.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Adds bug-fix entries for feed_range split pagination and VALUE AVG multi-partition behavior. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_base.py | Updates query result merging to better classify/handle aggregate partials; adds clearer error for unsupported VALUE AVG merges. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py | Refactors sync __QueryFeed to use structured feed_range continuation and split-safe pagination helpers, plus checkpoint-on-failure. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/base_execution_context.py | Adds per-query response-header capture dict and uses it to resume from checkpoint after 410 split retries. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/base_execution_context.py | Async equivalent of per-query header capture and checkpoint-based 410 retry resume. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_query_aggregate_utils.py | New helpers to detect/identify VALUE aggregates and classify aggregate partial rows. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_routing/feed_range_continuation.py | New shared implementation for structured v=1 token codec, identity hashing, and feed-range pagination helpers. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py | Refactors async __QueryFeed similarly to sync, using shared helpers for split-safe feed_range pagination and checkpointing. |
| sdk/cosmos/azure-cosmos/azure/cosmos/container.py | Passes container properties into the request pipeline to support full-PK feed-range routing logic. |
| sdk/cosmos/azure-cosmos/tests/test_crud_subpartition.py | Adds resume/pagination tests for full and prefix hierarchical partition keys (MultiHash). |
| sdk/cosmos/azure-cosmos/tests/test_crud_subpartition_async.py | Async equivalents of new subpartition resume/pagination tests. |
| sdk/cosmos/azure-cosmos/tests/test_partition_split_retry_unit.py | Adds unit tests validating internal header capture + checkpoint continuation use during 410 retry flows (sync). |
| sdk/cosmos/azure-cosmos/tests/test_partition_split_retry_unit_async.py | Async equivalents of new 410 retry checkpoint/capture unit tests. |
| sdk/cosmos/azure-cosmos/tests/test_query.py | Adds tests asserting full-PK queries emit legacy continuations and resume correctly; relaxes index-metrics assertions. |
| sdk/cosmos/azure-cosmos/tests/test_query_async.py | Async equivalents of full-PK legacy continuation tests; relaxes index-metrics assertions. |
| sdk/cosmos/azure-cosmos/tests/test_query_cross_partition.py | Relaxes index-metrics assertions to tolerate backend variations. |
| sdk/cosmos/azure-cosmos/tests/test_query_feed_range_multipartition.py | New end-to-end tests for multi-partition-overlap feed_range pagination, merge fallback, checkpointing, identity mismatch, and post-split resume. |
| sdk/cosmos/azure-cosmos/tests/test_query_feed_range_multipartition_async.py | Async equivalents of new multi-partition feed_range end-to-end tests. |
|
@sdkReviewAgent-2 |
|
✅ Review complete (11:14) Posted 5 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
@sdkReviewAgent-2 |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
✅ Review complete (07:47) Posted 2 inline comment(s). Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
Testing Gap: No Live Test for SELECT VALUE AVG(...) Raising ValueError Background: The PR's new _query_aggregate_utils.py + _base.py logic detects when a SELECT VALUE AVG(...) query spans multiple physical partitions. Previously, the SDK would silently merge AVG partials by addition — producing mathematically wrong results. Now it raises ValueError instead. What's covered:
What's missing:
The code path that's untested: This only triggers when (a) the query is SELECT VALUE AVG(...) AND (b) the scope spans multiple physical partitions requiring client-side merge. Single-partition or server-handled AVG still succeeds. Why it matters: Without this test, a regression could silently return wrong averages again (the original bug this PR fixes) or fail to surface the documented error. The most dangerous failure mode — returning incorrect data without any error — would be completely invisible. What a proper test would look like:
|
simorenoh
left a comment
There was a problem hiding this comment.
Just one comment on missing tests that would be good to have, LGTM otherwise
|
added the new test |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
kushagraThapar
left a comment
There was a problem hiding this comment.
thanks for adding the tests as well, LGTM, thanks @dibahlfi
|
/azp run python - cosmos - tests |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Summary
This PR makes Python Cosmos SDK query continuation tokens safe to round-trip across SDK versions during a rolling deployment, while fixing the silent split-bug class for feed_range and prefix partition_key queries.
The behavior is governed by a single rule, applied per request based on how many physical partitions the caller's input scope currently maps to in the PKRANGE cache:
• Currently 1 partition → emit the legacy single-string continuation (read and write). Old SDKs can read it; new SDKs can resume it.
• Currently >1 partition → emit the structured v=1 envelope (read and write). Required to safely represent one resume position per partition.
• Inbound legacy token, but the input scope now covers >1 partition (split happened since the token was minted) → restart that input range once, then emit v=1 from that page onward.
Full-PK queries are structurally always single-partition, so they always emit legacy format regardless of inbound — the strongest cross-version guarantee in the SDK.
Token formats
• Legacy — opaque single-string continuation the SDK has emitted for years. One position slot. Readable by any SDK version.
• v=1 envelope — base64-encoded JSON carrying one resume position per partition plus identity hashes. Only the new SDK can read it; old SDKs forward it verbatim and the service rejects it with 400 BadRequest.
Decoded v=1 shape(conceptual):
{
"v": 1,
"qh": "<hash of query text + parameters>",
"cr": "/dbs/.../colls/...",
"frh": "",
"c": [
{ "min": "...", "max": "...", "bc": "" },
...
]
}
qh / cr / frh let the SDK refuse a token replayed against a different query, container, or feed_range. On mismatch the SDK raises ValueError unwrapped to the caller.
Cross-SDK compatibility matrix
Mid-page failure handling:
When the pagination loop is mid-page and a backend POST raises (any HTTP error, network failure, etc.), the SDK stamps a resumable checkpoint into last_response_headers[Continuation] before re-raising the original exception. The checkpoint shape follows the same per-request rule above: legacy if the input range currently maps to one partition (always, for full-PK), v=1 if currently more than one. The caller's by_page() loop sees the exception and can resume on the saved checkpoint.
Rollout safety
• Single-partition input ranges (the common case): zero impact across versions in either direction. Bookmarks stay legacy both ways.
• Full-PK: structurally single-partition forever. Strongest guarantee — persist a bookmark, replay from any SDK version, no 400.
• Multi-partition input ranges: bookmarks are v=1 and only readable by the new SDK. During rolling upgrade, an old pod that picks up a new pod's v=1 token loses that page with a 400 and clears-and-restarts.